{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Open Domain Question Answering (ODQA)\n",
    "\n",
    "[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/deeppavlov/DeepPavlov/blob/master/docs/features/models/ODQA.ipynb)\n",
    "\n",
    "# Table of contents \n",
    "\n",
    "1. [Introduction to the task](#1.-Introduction-to-the-task)\n",
    "\n",
    "2. [Get started with the model](#2.-Get-started-with-the-model)\n",
    "\n",
    "3. [Models list](#3.-Models-list)\n",
    "\n",
    "4. [Use the model for prediction](#4.-Use-the-model-for-prediction)\n",
    "\n",
    "    4.1 [Predict using Python](#4.1-Predict-using-Python)\n",
    "\n",
    "    4.2 [Predict using CLI](#4.2-Predict-using-CLI)\n",
    "\n",
    "5. [Customize the model](#5.-Customize-the-model)\n",
    "\n",
    "    5.1 [Description of config parameters](#5.1-Description-of-config-parameters)\n",
    "    \n",
    "    5.2 [Building the index and training the reader model](#5.2-Building-the-index-and-training-the-reader-model)\n",
    "\n",
    "# 1. Introduction to the task\n",
    "\n",
    "**Open Domain Question Answering (ODQA)** is a task to find an exact answer\n",
    "to any question in **Wikipedia** articles. Thus, given only a question, the system outputs\n",
    "the best answer it can find.\n",
    "The default ODQA implementation takes a batch of queries as input and returns the best answer.\n",
    "\n",
    "English ODQA version consists of the following components:\n",
    "\n",
    "- TF-IDF ranker, which defines top-N most relevant paragraphs in TF-IDF index;\n",
    "- Binary Passage Retrieval (BPR) ranker, which defines top-K most relevant in binary index;\n",
    "- a database of paragraphs (by default, from Wikipedia) which finds N + K most relevant paragraph text by IDs, defined by TF-IDF and BPR ranker;\n",
    "- Reading Comprehension component, which finds answers in paragraphs and defines answer confidences.\n",
    "\n",
    "Russian ODQA version performs retrieval only with TF-IDF index.\n",
    "\n",
    "Binary Passage Retrieval is resource-efficient the method of building a dense passage index. The dual encoder (with BERT or other Tranformer as backbone) is trained on question answering dataset (Natural Questions in our case) to maximize dot product of question and passage with answer embeddings and minimize otherwise. The question or passage embeddings are obtained the following way: vector of BERT CLS-token is fed into a dense layer followed by a hash function which turns dense vector into binary one.\n",
    "\n",
    "# 2. Get started with the model\n",
    "\n",
    "First make sure you have the DeepPavlov Library installed.\n",
    "[More info about the first installation.](http://docs.deeppavlov.ai/en/master/intro/installation.html)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!pip install -q deeppavlov"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The example below is given for basic ODQA config [en_odqa_infer_wiki](https://github.com/deeppavlov/DeepPavlov/blob/1.1.1/deeppavlov/configs/odqa/en_odqa_infer_wiki.json).\n",
    "Check what [other ODQA configs](#3.-Models-list) are available and simply replace `en_odqa_infer_wiki`\n",
    "with the config name of your preference. [What is a Config File?](https://docs.deeppavlov.ai/en/master/intro/configuration.html)\n",
    "\n",
    "Before using the model make sure that all required packages are installed running the command:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!python -m deeppavlov install en_odqa_infer_wiki"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "There are alternative ways to install the model's packages that do not require executing a separate command -- see the options in the next sections of this page.\n",
    "\n",
    "# 3. Models list\n",
    "\n",
    "The table presents a list of all of the ODQA models available in the DeepPavlov Library.\n",
    "\n",
    "| Config | Description |\n",
    "| :--- | :--- |\n",
    "| odqa/en_odqa_infer_wiki.json | Basic config for **English** language. Consists of of Binary Passage Retrieval, TF-IDF retrieval and reader. |\n",
    "| odqa/en_odqa_pop_infer_wiki.json | Extended config for **English** language. Consists of of Binary Passage Retrieval, TF-IDF retrieval, popularity ranker and reader. |\n",
    "| odqa/ru_odqa_infer_wiki.json | Basic config for **Russian** language. Consists of TF-IDF ranker and reader. |\n",
    "\n",
    "The table presents the scores on Natural Questions and SberQuAD dataset and memory consumption.\n",
    "\n",
    "| Config | Number of<br>paragraphs | Dataset | F1 | EM | RAM | GPU | Time for <br> 1 query |\n",
    "| :--- | :---: | :--- | :---: | :---: | :---: | :---: | :---: |\n",
    "| odqa/en_odqa_infer_wiki.json | 200 | Natural Questions | 45.2 | 37.0 | 10.4 | 2.4 | 4.9 s |\n",
    "| odqa/ru_odqa_infer_wiki.json | 100 | SberQuAD | 59.2 | 49.0 | 13.1 | 5.3 | 2.0 s |\n",
    "\n",
    "# 4. Use the model for prediction\n",
    "\n",
    "## 4.1 Predict using Python\n",
    "\n",
    "### English"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from deeppavlov import build_model\n",
    "\n",
    "odqa_en = build_model('en_odqa_infer_wiki', download=True, install=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Input**: List[questions]\n",
    "\n",
    "**Output**: Tuple[List[answers], List[answer scores], List[answer places in paragraph]]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[['Luke Skywalker'], [4.196979999542236]]\n"
     ]
    }
   ],
   "source": [
    "odqa_en([\"What is the name of Darth Vader's son?\"])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Russian"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from deeppavlov import build_model\n",
    "\n",
    "odqa_ru = build_model('ru_odqa_infer_wiki', download=True, install=True)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[['на востоке и юге Австралии'], [0.9999760985374451]]\n"
     ]
    }
   ],
   "source": [
    "odqa_ru([\"Где живут кенгуру?\"])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 4.2 Predict using CLI\n",
    "\n",
    "You can also get predictions in an interactive mode through CLI (Сommand Line Interface)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "! python -m deeppavlov interact en_odqa_infer_wiki -d"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "`-d` is an optional download key (alternative to `download=True` in Python code). The key `-d` is used to download the pre-trained model along with embeddings and all other files needed to run the model.\n",
    "\n",
    "# 5. Customize the model\n",
    "\n",
    "## 5.1 Description of config parameters\n",
    "\n",
    "Parameters of ``bpr`` component:\n",
    "    \n",
    "- ``load_path`` - path with checkpoint of query encoder and bpr index;\n",
    "- ``query_encoder_file`` - filename of query encoder (Transformer-based model which takes a question as input and obtains its binary embedding);\n",
    "- ``bpr_index`` - filename with BPR index (matrix of paragraph binary vectors);\n",
    "- ``pretrained_model`` - Transformer model, used in query encoder;\n",
    "- ``max_query_length`` - maximal length (in sub-tokens) of the input to the query encoder;\n",
    "- ``top_n`` - how many paragraph IDs to return per a question.\n",
    "\n",
    "Parameters of ``tfidf_ranker`` component:\n",
    "\n",
    "- ``top_n`` - how many paragraph IDs to return per a question.\n",
    "\n",
    "Parameters of ``logit_ranker`` component:\n",
    "\n",
    "- ``batch_size`` - the paragraphs from the database (some of which contain the answer to the question, others - do not contain) will be split into batches with the size ``batch_size`` for extraction of candidate answer in each paragraph;\n",
    "- ``squad_model`` - the model which finds spans of an answer in a paragraph;\n",
    "- ``sort_noans`` - whether to put paragraphs with no answer in the end of paragraph list, sorted by confidences;\n",
    "- ``top_n`` - the number of possible answers for a question;\n",
    "- ``return_answer_sentence`` - whether to return the sentence from the paragraph with the answer.\n",
    "\n",
    "## 5.2 Building the index and training the reader model\n",
    "\n",
    "There are two customizable components in ODQA configs:\n",
    "\n",
    "- TF-IDF ranker;\n",
    "- Reading comprehension model.\n",
    "\n",
    "If you would like to build the TF-IDF index for your own text database, read [here](https://docs.deeppavlov.ai/en/master/features/models/tfidf_ranking.html#ranker-training). \n",
    "\n",
    "In addition, to train the Reader on your data, read [here](https://docs.deeppavlov.ai/en/master/features/models/SQuAD.html#4.1-Train-your-model-from-Python)."
   ]
  }
 ],
 "metadata": {},
 "nbformat": 4,
 "nbformat_minor": 4
}
